Prevent crashes at exit#1287
Conversation
|
|
||
| _global_loop = None | ||
| atexit.register(partial(_cleanup, _global_loop)) | ||
| atexit.register(lambda: _cleanup(_global_loop)) |
There was a problem hiding this comment.
If lambdas are not of taste
| atexit.register(lambda: _cleanup(_global_loop)) | |
| @atexit.register | |
| def _(): | |
| global _global_loop | |
| _cleanup(_global_loop) |
|
Thanks for the PR @KowalskiThomas! We just got 3.30.0 out the door but I'm hoping to get this fix into our next release (tentatively 3.31.0). I'll have a review coming soon-ish... I'm still just a bit behind after finishing up the release so it may be just a little while yet. |
|
Sounds great, thank you! |
|
we were hitting this one for years, we nailed it at our end, not that long ago (😭 and now I see the fix wasn't merged) |
|
Thanks for the PR @KowalskiThomas! All makes sense to me conceptually and the changes seem quite sane. As mentioned by @fruch the fact that it matches up nicely to the proposed change on the Scylla driver makes me feel even better. I'd like to run this one through a round of testing on our Jenkins instance just to make sure there isn't some weird unintended side-effect but I don't expect much to show up there... it's hard to imagine why this would be a problem. I'll provide an update after I get this running on Jenkins. |
What is this PR?
This PR attempts to fix a bug we are seeing in our services and CI and that we have identified as coming from the Python Cassandra driver (
cassandra-python-driveris the only Python package in our environments that haslibevas a dependency).Stack traces typically look like the following, and crashes happen around exit.
This looks a lot like a problem we have identified in our own https://github.com/DataDog/dd-trace-py repository, where we need to make sure the interpreter hasn't finalised before trying to acquire the GIL -- otherwise it's not legal to do so and it typically just crashes.
However I think in the case of this code, this may actually be caused by another issue: the current
atexithook is invalid.libevreactor.pydoes this:This "generates" the
atexithook function when it is called so with the current_global_loopvalue as opposed to evaluating_global_loopat exit.As an illustration, take this:
Running this prints
Noneand notsomething.The right way to do this (if we want to keep it at the top level) would be to use a lambda:
... which does print
somethingat exit.My proposed fix is thus in two parts:
atexithook registration frompartialto alambda_cleanupasks the thread to stop so it shouldn't keep on trying to execute CPython code -- the fact thatjoinis called with a timeout of 1 second means it can fail, in which case the existing code is not safe (this is why we need the same mechanism inddtrace).Let me know if that makes sense!